STAN: Structural Analysis for Web Documents
نویسنده
چکیده
In this paper we present STAN, a structural analysis tool used for classifying web documents while at the same time extracting meaningful information from them. The extraction and classification rules are defined in terms of a structrural grammar operating on both layout properties and content properties of the document. Stan was designed to accept HTML as input and is able to process documents at a speed of several MB/s (depending on the complexity of the structural grammar used).
منابع مشابه
Using semantic web technologies for analysis and validation of structural markup
An increasing part of research in the Semantic Web has been directed at making data become the main concept of the web. Plenty of languages and specifications support this transition and work by inserting additional (semantic) markup into web documents. Yet, little attention is being paid to the possibility of expressing the actual structures of the documents in a form suitable for the semantic...
متن کاملThe Impact of Ontology on the Performance of Information Retrieval : A Case of
The large amount and heterogeneity of XML documents on the Web requires the development of clustering techniques to group together similar documents. Documents can be grouped together according to their content, their structure, and the links inside and among the documents. For instance, grouping together documents with similar structure has interesting applications in the context of informatio...
متن کاملAutomatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis
Although RDF/XML has been widely recognized as the standard vehicle for representing semantic information on the Web, an enormous amount of semantic data is still being encoded in HTML documents that are designed primarily for human consumption and not directly amenable to machine processing. This paper seeks to bridge this semantic gap by addressing the fundamental problem of automatically ann...
متن کاملA Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity
Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. Therefore it is a new challenge for the field of data mining to turn these documents into a more useful information utility. We present a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according ...
متن کاملMulti-objective Optimization of web profile of railway wheel using Bi-directional Evolutionary Structural Optimization
In this paper, multi-objective optimization of railway wheel web profile using bidirectional evolutionary structural optimization (BESO) algorithm is investigated. Using a finite element software, static analysis of the wheel based on a standard load case, and its modal analysis for finding the fundamental natural frequency is performed. The von Mises stress and critical frequency as the proble...
متن کامل